Current Issue : October - December Volume : 2016 Issue Number : 4 Articles : 6 Articles
This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic\nSpeech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms\nand asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best\ncompensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology\nthat expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits\ntransitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background\nin each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms\ncan be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade\nfor an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus\nincorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A\nrelative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade.\nFinally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training,\naCMLLR transforms and MLLR transforms provided a relative improvement of 2ââ?¬â??3%....
In audio communication systems, the perceptual audio quality of the reproduced audio signals such as the\nnaturalness of the sound is limited by the available audio bandwidth. In this paper, a wideband to super-wideband\naudio bandwidth extension method is proposed using an ensemble of recurrent neural networks. The feature space\nof wideband audio is firstly divided into different regions through clustering. For each region in the feature space, a\nspecific recurrent neural network with a sparsely connected hidden layer, referred as the echo state network, is\nemployed to dynamically model the mapping relationship between wideband audio features and high-frequency\nspectral envelope. In the following step, the outputs of multiple echo state networks are weighted and fused by\nmeans of network ensemble, in order to further estimate the high-frequency spectral envelope. Finally, combining\nthe high-frequency fine spectrum extended by spectral translation, the proposed method can effectively extend the\nbandwidth of wideband audio to super wideband. Objective evaluation results show that the proposed method\noutperforms the hidden Markov model-based bandwidth extension method on the average in terms of both static\nand dynamic distortions. In subjective listening tests, the results indicate that the proposed method is able to\nimprove the auditory quality of the wideband audio signals and outperforms the reference method....
We present a content-based automatic music tagging algorithm\nusing fully convolutional neural networks (FCNs).\nWe evaluate different architectures consisting of 2D convolutional\nlayers and subsampling layers only. In the experiments,\nwe measure the AUC-ROC scores of the architectures\nwith different complexities and input types using\nthe MagnaTagATune dataset, where a 4-layer architecture\nshows state-of-the-art performance with mel-spectrogram\ninput. Furthermore, we evaluated the performances of the\narchitectures with varying the number of layers on a larger\ndataset (Million Song Dataset), and found that deeper models\noutperformed the 4-layer architecture. The experiments\nshow that mel-spectrogram is an effective time-frequency\nrepresentation for automatic tagging and that more complex\nmodels benefit from more training data....
In multichannel spatial audio coding (SAC), the accurate representations of virtual sounds and the efficient\ncompressions of spatial parameters are the key to perfect reproduction of spatial sound effects in 3D space. Just\nnoticeable difference (JND) characteristics of human auditory system can be used to efficiently remove spatial\nperceptual redundancy in the quantization of spatial parameters. However, the quantization step sizes of spatial\nparameters in current SAC methods are not well correlated with the JND characteristics. It results in either spatial\nperceptual distortion or inefficient compression. A JND-based spatial parameter quantization (JSPQ) method is\nproposed in this paper. The quantization step sizes of spatial parameters are assigned according to JND values of\nazimuths in a full circle. The quantization codebook size of JSPQ was 56.7 % lower than one of the quantization\ncodebooks of MPEG surround. Average bit rate reduction on spatial parameters for standard 5.1-channel signals\nreached up to approximately 13 % compared with MPEG surround, while preserving comparable subjective spatial\nquality....
We propose a study of the mathematical properties of voice as an audio signal. This\nwork includes signals in which the channel conditions are not ideal for emotion recognition.\nMulti resolution analysis- discrete wavelet transform ââ?¬â?? was performed through the use of\nDaubechies Wavelet Family (Db1-Haar, Db 6, Db8, Db10) allowing the decomposition of the\ninitial audio signal into sets of coefficients on which a set of features was extracted and\nanalyzed statistically in order to differentiate emotional states. ANNs proved to be a system\nthat allows an appropriate classification of such states. This study shows that the extracted\nfeatures using wavelet decomposition are enough to analyze and extract emotional content in\naudio signals presenting a high accuracy rate in classification of emotional states without the\nneed to use other kinds of classical frequency-time features. Accordingly, this paper seeks to\ncharacterize mathematically the six basic emotions in humans: boredom, disgust, happiness,\nanxiety, anger and sadness, also included the neutrality, for a total of seven states to identify....
A new voice activity detection algorithm based on long-term pitch divergence is presented. The long-term pitch\ndivergence not only decomposes speech signals with a bionic decomposition but also makes full use of long-term\ninformation. It is more discriminative comparing with other feature sets, such as long-term spectral divergence.\nExperimental results show that among six analyzed algorithms, the proposed algorithm is the best one with the\nhighest non-speech hit rate and a reasonably high speech hit rate....
Loading....